The Art of SEO : Duplicate Content Issues (part 2) - How Search Engines Identify Duplicate Content

12/14/2010 3:39:43 PM

2. How Search Engines Identify Duplicate Content

Some examples will illustrate the process for Google as it finds duplicate content on the Web. In the examples shown in Figures Figure 1 through Figure 4, three assumptions have been made:

The page with text is assumed to be a page containing duplicate content (not just a snippet, despite the illustration).
Each page of duplicate content is presumed to be on a separate domain.
The steps that follow have been simplified to make the process as easy and clear as possible. This is almost certainly not the exact way in which Google performs (but it conveys the effect).

Figure 1. Google finding duplicate content

Figure 2. Google comparing the duplicate content to the other copies

Figure 3. Duplicate copies getting tossed out

Figure 4. Google choosing one as the original

There are a few facts about duplicate content that bear mentioning as they can trip up webmasters who are new to the duplicate content issue:

Location of the duplicate content

Is it duplicated content if it is all on my site? Yes, in fact, duplicate content can occur within a site or across different sites.

Percentage of duplicate content

What percentage of a page has to be duplicated before you run into duplicate content filtering? Unfortunately, the search engines would never reveal this information because it would compromise their ability to prevent the problem.

It is also a near certainty that the percentage at each engine fluctuates regularly and that more than one simple direct comparison goes into duplicate content detection. The bottom line is that pages do not need to be identical to be considered duplicates.

Ratio of code to text

What if your code is huge and there are very few unique HTML elements on the page? Will Google think the pages are all duplicates of one another? No. The search engines do not really care about your code; they are interested in the content on your page. Code size becomes a problem only when it becomes extreme.

Ratio of navigation elements to unique content

Every page on my site has a huge navigation bar, lots of header and footer items, but only a little bit of content; will Google think these pages are duplicates? No. Google (and Yahoo! and Bing) factor out the common page elements such as navigation before evaluating whether a page is a duplicate. They are very familiar with the layout of websites and recognize that permanent structures on all (or many) of a site’s pages are quite normal. Instead, they’ll pay attention to the “unique” portions of each page and often will largely ignore the rest.

Licensed content

What should I do if I want to avoid duplicate content problems, but I have licensed content from other web sources to show my visitors? Use meta name = "robots" content="noindex, follow". Place this in your page’s header and the search engines will know that the content isn’t for them. This is a general best practice, because then humans can still visit the page, link to it, and the links on the page will still carry value.

Another alternative is to make sure you have exclusive ownership and publication rights for that content.